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1 Introduction 

Nowadays, it is straightforward that energy efficiency is a 
crucial aspect of embedded systems where a huge number 
of small and very specialized autonomous devices interact- 
ing together through many kinds of media (wired/wireless 
network, bluetooth, GSM/GPRS, infrared. . . ). Moreover, we 
know that the uniprocessor paradigm will no longer hold in 
those devices. Even today, a lot of mobile phones are already 
equipped with several processors. 

In this ongoing work, we are interested in multiprocessor 
energy efficient systems, where task durations are not known 
in advance, but are know stochastically. More precisely, we 
consider global scheduling algorithms for frame-based multi- 
processor stochastic DVFS (Dynamic Voltage and Frequency 
Scaling) systems. Moreover, we consider processors with a 
discrete set of available frequencies. 

In the past few years, a lot of work has been provided 
in multiprocessor energy efficient systems. Most work was 
done considering static partitioning strategies, meaning that 
a task was assigned to a specific processor, and each instance 
of this task runs on the same processor First of those work 
where devoted to deterministic tasks (with a task duration 
known beforehand, or the worst-case is considered), such as 
im El in 13, and later probabilistic models were also consid- 
ered ||71|6l. Only a little work has been provided about global 
scheduling, such as |[3|, but for deterministic systems, or |9 1, 
using some slack reclamation mechanism, but not really us- 
ing stochastic information. 

As far as we know, no work has been provided with global 
scheduling on stochastic tasks. We propose to work towards 
this direction. Notice that the frame-based model we con- 
sider in our work, where every task share the same period, is 
also used by many researchers, such as El |3] |6l ID . 

2 Model 

We consider n sequential tasks ri , . . . , r„ . Task requires 
X cycles with a probability Ci{x), and its maximum number 
of cycles is Wi (Worst Case Execution Cycles, or WCEC). 
The number of cycles a task requires is not known before 
the end of its execution. We consider a frame-based model, 
where all tasks share the same deadline and period D and are 
synchronous. In the following D denote the frame length. 
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Those tasks run on m identical CPU IIi , . . . , n„i, and each 
of those CPU can run at M frequencies /i , . . . , fi\j. 

We consider that tasks cannot be preempted, but different 
instances of the same task can run on different processors, 
i.e., task migrations are allowed. We consider global schedul- 
ing techniques which schedule a queue of tasks ; each time a 
CPU is available, it picks up the first task in the queue, choose 
a frequency, and run the job. We assume the system is expe- 
dienj^ and the job order has been chosen beforehand, but in 
some cases, in order to ensure the schedulability, the sched- 
uler can adapt that order. In other words, we assume that the 
initial task order is not crucial and can be considered to be a 
soft constraint. 

3 Global Scheduling Algorithm 

In |2|, we have provided techniques allowing to schedule 
such a task set on a single CPU. The main idea is to compute 
(offline) a function giving, for each task, the frequency to run 
the task based on the time elapsed in the current frame. This 
function, Si{t) gave the frequency at which Ti should run if 
started at time t in the current frame. Here, for the sake of 
clarity, we are going to consider the symmetric function of 
S: Si{d) — Si{D — d) gives the frequency for T,j if this task 
is started d units of time before the end of the frame. 

In the uniprocessor case, we were able to give schedula- 
bility guarantees, as well as good energy consumption per- 
formance. We want to be able to provide both in this multi- 
processor case, using a global scheduling algorithm. As far 
as we know, global scheduling algorithm on multiprocessor 
system using stochastic tasks, and a limited number of avail- 
able frequencies, has not been considered so far 

The idea of our scheduling algorithm is to consider that 
a system with m CPU, and a frame length D, is close to a 
system with a single CPU, but a frame length m x D, or, 
with a frame length D, but m times faster. We then first 
compute a set of n S'-functions considering the same set of 
tasks, but a deadline m x D. A very naive approach would 
consist in considering that when a task ends at time t, the 
total remaining available time before the deadline is the sum 
of remaining time available on each CPU, which means D — t 
on the current CPU, and D — tp on the other ones, where tp 
is the worst time at which the task currently running on lip 
will end. Then, we could use Si{d) to choose the frequency. 

Unfortunately, this simple approach does not work, be- 
cause a single task cannot use time on several CPUs simul- 

' An expedient system is a system where tasks never wait intentionally. 
In other words, if a task is ready, the processor cannot be idle. 
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taneously. However, if the number of tasks is reasonably 
greater than the number of CPUs, we think that in most cases. 
Si {d) will not require to use more than the available time on 
the current CPU, and somehow, will let the available time on 
other CPUs for future tasks. And when Si{d) requires more 
time than actually available, we just use a faster frequency. 

Of course, we need to ensure the schedulability of the sys- 
tem, which cannot be guarantied with the previous approach: 
for instance, at the end of a frame, we might have some slack 
time unusable because too short to run any of the remain- 
ing task. But as this time has been taken into account when 
we chose the frequency of previous tasks, we might miss the 
deadline if we do not take any precaution. 

The algorithm we propose is composed of two phases, one 
off-line, and one on-line. The off-line one consists in per- 
forming a (virtual) static partitioning, aiming at reserving 
enough time in the system for each task. This phase is close 
to what we did in 12] with Danger Zones. The on-line phase 
uses both this pre-reservation to ensure the schedulability 
(but performing dynamic changes to this static partitioning), 
and the S'-functions, to improve the energy efficiency. 



3.1 Virtual Static Partitioning 

We first perform a "virtual static partitioning". The aim of 
this partitioning is not to assign a task to a processor, but 
to make sure that every task can be executed. A task does 
not have to run on its assigned processor, but we know that 
some time has been reserved for this task, which allows to 
guarantee the schedulability. 

This static partitioning can be performed in many ways, 
but we propose in Algorithm [T] to do it as balanced as possi- 
ble, by sorting tasks according to their WCEC. 



Algorithm 1: Static partitioning 



1 Tp 



Ap = Oyp;// Reserved time on Hp 
{}Vp;// Tasks assigned to Hp 
2 foreach descending sorted by Wi do 

q = a-rgmiUp Ap-, / / cpu with the largest 

not yet assigned time 
if Z) - A„ > then 

II Ti reservation 



/a 



else 

\_ Failed! 



After this first step of virtual static partitioning, we can 
see the system as in Figure [T| left part. Notice that it is not 
because we cannot manage to do this virtual partitioning that 
the system is not schedulable. But at least, if we manage 
to do so, then we can ensure that the system is schedulable. 
This virtual static partitioning can be computed offline, and 
used for the whole life of the system. 



Figure 1 Left: Static partitioning. Right: State of the system 
after having started tasks {ti, . . . , ry}. Notice that reserva- 
tions (dashed tasks) correspond to worst cases, while effec- 
tive tasks (plain lines) are actual execution times, and change 
then from frame to frame. Vertical axis is frequency, horizon- 
tal is time. Then areas correspond to amount of computation. 















5 


6 i 12 


3 


...5... 


■7 




9 i 12 i 










-1 


1 



















1 4 











7 


10 i ii j 9 








1 















3 






10 


11 




6 



















D 



3.2 On-line algorithm 

Based on the virtual static partitioning, the main idea of the 
on-line part is to start a task at a frequency which allows it to 
end before the beginning of the "reserved" part of the frame. 
For instance, in Figure [T[ t\ could start on Hi using all the 
space between the beginning of the frame, and the reserved 
space for T5. But we will see situations where the scheduler 
needs to give more time for t\. In such cases, we can also 
move, for instance, T5 or rg on 112, or Tyi to 113. By doing 
so, and because we never let a running task using the reserved 
time of another (not started) task, we can guarantee that, if 
we were able to build a partitioning in the on-Une phase, no 
task will never miss its deadline. Of course, as soon as a task 
starts, we release the reserved time for this task. 

The on-line part of the algorithm is given in Algorithm |4] 
We first give some explanation about two procedures we need 
in the main algorithm. 

3.2.1 MoveTasksOut 

This procedure (Algorithm |2]i aims at moving enough tasks 
from CPU Hp, until enough space (the quantity s in the al- 
gorithm) is available, or no task can be moved anymore. For 
instance, in Figure [T] at time t = 0, we may want to run t\ 
on Hi at frequency /2. But according to the worst case of t\, 
we do not have enough time to run this task between 0, and 
the beginning of the reserved area of t^. However, we can 
move Ta to Ha, and rs or rg to 112. 

While s units of time is not available, we take the largest 
task on Hp, and put it on the CPU with the largest free space. 
This is of course a heuristic, since finding the optimal choice 
is probably NP-hard or at least intractable problem. 

3.2.2 MoveTaskIn 

This procedure (Algorithm [3]l aims at trying to move a task 
Ti assigned to some CPU Ilg to the CPU Hp. The main idea 
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Algorithm 2: MoveTasksOut 



Data: processor Hp, current time t, space to free s 

II Move out tasks from Hp until s 



units of time are free from t. 
1 while £> — i < s do 

Ti = next task in 7^ (sorted by decreasing wD; 
if No such Ti then 
1^ break; 

q = argniax^_^p D — — t^; 1 1 cpu with 
the maximal amount of available 



space 



then 



// Enough place to move Ti on 11^ 
T = T \ T- 



" /iV 



/a 



is that we first move out as many tasks as needed from Hp 
(line[T]), until we have enough space to import Ti (lines [2] to 
|6]l. If we have not managed to get enough space, false is 
returned (line|8]l. However, this algorithm is a heuristic, and 
is not always able to find a solution, even whether such a 
solution exists. 

For instance (see Figure [T] right part), at the end of T7, 
we would like to start Tg on Hi. But neither rg not T12 can 
be moved on another CPU, so our algorithm fails in finding a 
solution. However, a smarter algorithm could find out that by 
swapping rg and Tg, ts would be able to start on Hi. Notice 
that giving a solution in any solvable case is probably also an 
NP-hard or at least intractable problem. 

The procedure we give here is quite naive, and not very ef- 
ficient, but we let a better algorithm for further research. The 
naiveness of this algorithm does not affect the schedulability 
at all: it just makes the system to be forced more often to 
accept tasks order changes, which might degrade the energy 
efficiency (S'-functions are computed according to the given 
order), and the user satisfaction, if its preferences are often 
not respected. 



Algorithm 3: MoveTaskIn 



Data: processor Hp, task 

Result: true if can be moved on Hp, false otherwise 

// Move enough tasks from Hp to let Ti 
running 



t. 



4 

5 
6 

7 else 



then 



); 



MoveTasksOut(np, 
itD -t~ A„> f- 

let q be such as S 11^; 



// Move Ti from n„ to n„ 



\ T, ; Aq 



Tp = TpUTi 
return true; 



/a 



Ar,+ = 



8 1^ return false; 



3.2.3 Main algorithm 

Here are the main steps of the procedure given in Algo- 
rithm|4j which is called each time a CPU (say Hp) is available, 
at time t, with Ti the next task to start. This procedure will 
always start at task at a speed guarantying deadlines, but not 
necessarily Ti. 

• line [1] We first evaluate d, the remaining time we have 
for Tj, . . . , r„: if tq is the worst time where 11^ is going 
to be available (the time of the last start, plus the worst 
case execution time of the current task at the chosen fre- 
quency), we have: 



tq) = PD 




line|2] Let / = Si{d), the frequency chosen for Ti in the 
single CPU model with d units of time before the dead- 
line. We are going to check if we can use this frequency 
(we assume this frequency to be a "good" one from the 
energy consumption point of view). 



line 3|6 If r; was not assigned to Hp, we first try to 
move it to lip (Algorithm |3|l. If we have enough space 
on Hp, the situation is easy. Otherwise, we need to move 
some tasks out from IIp, in order to create enough space. 

line [5] If we cannot manage to make enough space, 
then we are not able to start Ti right now. We try 
then the same procedure for r^+i, but we need to left- 
shift S'-functions of j^. This is not required from the 
schedulability point of view (we ensure the schedula- 
bility by controlling the available time), but we guess 
it will improve the energy consumption. For the same 
reason, we will need to right-shift functions of the same 
amount when starts, because we have one task less to 
run after r^. (This improvement is not yet implemented 
in the given algorithm. It requires to be done carefully, 
because we might have several swapped tasks). 

line |9] If we succeeded, we try to move as many tasks 
as possible from lip to other CPUs (Algorithm |2|, until 
we have enough space to start t; at /, or no task can 
be moved anymore. We then start Ti either at /, or at 
the smallest frequency allowing to run Tj in the space 
we manage to free (line 111. As r,; was assigned to lip 



(possibly after some changes), we are at least sure that 
we can start Ti at f]\j. 

Notice that when StartTask is invoked, it is always possible 
to run a job, and therefore, we will never consider t„+i in 
Algorithm]?! line|5] Because of space limitation, we will not 
give the proof here. 

4 Work-in-progress 

Here are a few points we want to look deeper, allowing to 
improve the energy consumption, or the number of systems 
we are able to schedule. 
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Algorithm 4: StartTask 



Data: Time t, processor Hp, task r,; 

d= P X D 



t + J2q^ptq)'^// Available 
time on the system 

2 f = Si{d)\ / / Freq. we want to run 

3 if T, ^ Tp tlien 

// Tj is not on lip, we try to move 
it in 

4 if not MoveTaskIn(Iip, Ti) then 

5 StartTask(i, lip, r^+i); 

6 return ; 



//We have now G 7^ 

7 Ar,— = // Release Tj reservation 

^ /A/ 

8 Tp — Tp \ Ti, 

II Try to remove enough tasks (if 

needed) from Hp to allow to run 
at the desired speed / 

9 MoveTasksOutCHp, t, "f); 

10 if Z? - i - < then 

// Not enough time to run at 
freq / 



11 



/ 



D- Ap-t 



12 ip+=y-;// Worst end time for 

13 Start Ti at /; 



• At the end of a frame, assuming we can verify that after 
the task we start, we won't run tasks anymore on this 
CPU, we can try to run tasks using the CPU until D. For 
instance, if we start a task on lip at a speed which lets a 
free space [tp, D] too small to run any of the remaining 
tasks, then we should try to stretch the task to use Hp up 
toD. 

• If we accept to change the frequency during the execu- 
tion of tasks, we can use the continuous model to obtain 
a frequency /, and use two frequencies [/] jr and [/J jf 
to "emulate" this /, where \f~\jr (resp. [/Jjp) stands for 
the smallest frequency above (resp. largest below) /. 

• Several steps require to solve NP-hard problems by us- 
ing some heuristics: Static partitioning (Algorithm [TJ, 
MoveTaskln (Algorithm [3]), and MoveTasksOut (Algo- 
rithm |2]i. The efficiency of the first one improves the 
number of systems we can accept to schedule, the sec- 
ond one, the number of tasks we will need to swap (not 
run in the right order), and the third one, how close we 
can stay from the uniprocessor algorithm. We may try 
to improve those three algorithms. 

• In order to reduce leakage or static energy consumption, 
we could turn off CPU if they are not needed anymore 
before the end of the frame. 

Of course, we also — and mainly — need to validate our 
model and show its efficiency by the way of simulations, us- 



ing reahstic environment and workloads. 
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